Customer Segmentation and Market Basket Analysis

Introduction: This notebook we are using public dataset On-line Retail to explore customer segmentation. Then we go further and apply association rule mining approach to find rules and patterns in this transaction database. These customer segmentation, rules and patterns can be used to make intsightful and useful decisions as far as users' and retailers' interest is concerned.

In [1]:

Load Dataset

In [2]:
In [3]:
In [4]:
Out[4]:
Unnamed: 0              int64
invoiceno              object
stockcode              object
description            object
quantity                int64
invoicedate    datetime64[ns]
unitprice             float64
customerid              int64
country                object
cancel                   bool
amount                float64
dtype: object

Customer Segmentation:

Customer segmentation is similarly the process of dividing an organization’s customer bases into different sections or segments based on various customer attributes. The process of customer segmentation is based on the premise of finding differences among the customers’ behavior and patterns.

The major objectives and benefits behind the motivation for customer segmentation are:

  • Higher Revenue
  • Customer Understanding
  • Target Marketing
  • Finding Latent Customer Segments

RFM Model for Customer Value:

Since our dataset is limited to the sales records, and didn´t include another information about our customers, we will use a RFM,*Recency, Frequency and Monetary Value, based model of customer value for finding our customer segments. The RFM model will take the transactions of a customer and calculate three important informational attributes about each customer:

  • Recency: The value of how recently a customer purchased at the establishment
  • Frequency: How frequent the customer’s transactions are at the establishment
  • Monetary value: The dollar (or pounds in our case) value of all the transactions that the customer made at the establishment
In [5]:
Reference Date: 2011-12-10 12:50:00
Out[5]:
count mean std min 25% 50% 75% max
customerid 4338.0 15300.408022 1721.808492 12346.0 13813.25 15299.5 16778.75 18287.0
recency 4338.0 92.536422 100.014169 1.0 18.00 51.0 142.00 374.0

We will plot the Recency Distribution and QQ-plot to identify substantive departures from normality, likes outliers, skewness and kurtosis.

In [6]:

With positive skewness, we confirm the lack of symmetry and indicate that sales recency are skewed right.

Frequency

In [7]:

With positive skewness, we confirm the high lack of symmetry and with Kurtosis indicates that is a heavy-tailed distribution and has outliers.

Monetary Value

In [8]:

With positive skewness, we confirm the high lack of symmetry and with Kurtosis indicates that is a too heavy-tailed distribution and has outliers.

Let's see a statistical summary of this dataset:

In [9]:
Out[9]:
customerid recency frequency amount
count 4338.000000 4338.000000 4338.000000 4338.000000
mean 15300.408022 92.536422 4.272015 2048.688081
std 1721.808492 100.014169 7.697998 8985.230220
min 12346.000000 1.000000 1.000000 3.750000
25% 13813.250000 18.000000 1.000000 306.482500
50% 15299.500000 51.000000 2.000000 668.570000
75% 16778.750000 142.000000 5.000000 1660.597500
max 18287.000000 374.000000 209.000000 280206.020000

Data Preprocessing

In [10]:
Out[10]:
count mean std min 25% 50% 75% max
amount_log 4338.0 1.181627e-16 1.000115 -4.172381 -0.682096 -0.063857 0.657274 4.722173
recency_log 4338.0 -1.027980e-16 1.000115 -2.630445 -0.612424 0.114707 0.829652 1.505796
frequency_log 4338.0 -2.355833e-16 1.000115 -1.048610 -1.048610 -0.279044 0.738267 4.882714
In [11]:

The obvious patterns we can see from the plots above is that costumers who buy with a higher frequency and more recency tend to spend more based on the increasing trend in Monetary (amount value) with a corresponding increasing and decreasing trend for Frequency and Recency, respectively.

Clustering for Segments

In [12]:
The best K sugest:  7

Note that by the Elbow method from a K equal to 3 we already observed low rates of gain in the decay of the distortions with the decrease of K reaching the limit of 10% with the K equal to 7. With this in mind, we will begin to evaluate the options more deeply with 3, and 7, starting with the silhouette analysis.

Silhouette analysis on K-Means clustering

In [13]:

When we look at the results of the clustering process, we can infer some interesting insights:

  • First notice that all K clusters options is valid, because they don't have presence of clusters with below average silhouette scores.
  • All options had a some wide fluctuations in the size of the silhouette plots.

Clusters Center:

Let's look at the cluster center values after returning them to normal values from the log and scaled version.

In [14]:
for 3 clusters the silhouette score is 0.34
Centers of each cluster:
        amount     recency  frequency
0  3928.956555    7.218684   9.997198
1   260.222876  116.191967   1.189203
2  1001.720068   34.138812   3.153595
--------------------------------------------------
for 5 clusters the silhouette score is 0.30
Centers of each cluster:
        amount     recency  frequency
0   211.549673  153.935498   1.084346
1  5330.428536    4.287685  13.077065
2  1959.126040   24.531810   5.433186
3   841.506177  101.012447   2.356421
4   389.672044   13.720296   1.743406
--------------------------------------------------
for 7 clusters the silhouette score is 0.31
Centers of each cluster:
         amount     recency  frequency
0    204.950848  225.600151   1.086187
1   2407.089421   37.803327   6.014123
2    237.781336   36.430512   1.132543
3  10156.379437    4.961015  20.687947
4    813.566173  107.106890   2.280821
5    661.120470   13.733298   2.660937
6   2100.809288    4.425310   6.403276
--------------------------------------------------

Clusters Insights:

With the plots and the center in the correct units, let's see some insights by each clusters groups:

In the three-cluster:

  • The three clusters appears have a good stark differences in the Monetary value of the customer, we will confirm this by a box plot.
  • Cluster 1 is the cluster of high value customer who shops frequently and is certainly an important segment for each business.
  • In the similar way we obtain customer groups with low and medium spends in clusters with labels 0 and 2, respectively.
  • Frequency and Recency correlate perfectly to the Monetary value based on the trend (High Monetary-Low Recency-High Frequency).

In the five-cluster:

  • Note that clusters 0 and 1 are very similar to their cluster in the configuration with only 3 clusters.
  • The cluster 1 appears more robust on the affirmation of those who shop often and with high amount.
  • The cluster 2 are those who have a decent spend but are not as frequent as the cluster 1
  • The cluster 4 purchases medium amounts, with a relatively low frequency and not very recent
  • The cluster 3 makes low-cost purchases, with a relatively low frequency, but above 1, and made their last purchase more recently. This group of customers probably response to price discounts and can be subject to loyalty promotions to try increase the medium-ticket, strategy that can be better defined when we analyzing the market basket.
  • The silhouette score matrix says that the five cluster segments are less optimal then the three cluster segments.

In the five-cluster:

  • Definitely cluster 6 defines those who shop often and with high amount.
  • Clusters 1 and 5 show good spending and good frequency, only deferring in how recent were their last purchases, where 5 is older, which suggests an active action to sell to group 5 as soon as possible and another to 1 seeking to raise its frequency.
  • Cluster 0 presents the fourth best purchase and a reasonable frequency, but this is a long time without buying. This group should be sensible to promotions and activations, so that they do not get lost and make their next purchase.
  • Cluster 5 is similar to 0, but has made its purchases more recently and has a slightly better periodicity. Then actions must be taken to raise their frequency and reduce the chances of them migrating to cluster 0 by staying longer without purchasing products.

Drill Down Clusters:

In [15]:
customerid recency frequency amount recency_log frequency_log amount_log clusters_3 clusters_5 clusters_7
0 12346 326.0 1 77183.60 5.786897 0.000000 11.253942 2 3 1
1 12347 2.0 7 4310.00 0.693147 1.945910 8.368693 0 1 6
2 12348 75.0 4 1797.24 4.317488 1.386294 7.494007 2 2 1
3 12349 19.0 1 1757.55 2.944439 0.000000 7.471676 2 4 5
4 12350 310.0 1 334.40 5.736572 0.000000 5.812338 1 0 0
In [16]:
Cluster 0Cluster 1Cluster 201.98k3.96k5.93k7.91k9.89k11.87k13.85k15.82k17.8k19.78k
Difference in amount with 3 Clusters and 0.34 Score
Cluster 0Cluster 1Cluster 203570105140175210245280315350
Difference in recency with 3 Clusters and 0.34 Score
Cluster 0Cluster 1Cluster 2036912151821242730
Difference in frequency with 3 Clusters and 0.34 Score
Cluster 0Cluster 1Cluster 2Cluster 3Cluster 403.47k6.94k10.4k13.87k17.34k20.81k24.28k27.74k31.21k34.68k
Difference in amount with 5 Clusters and 0.30 Score
Cluster 0Cluster 1Cluster 2Cluster 3Cluster 403672108144180216252288324360
Difference in recency with 5 Clusters and 0.30 Score
Cluster 0Cluster 1Cluster 2Cluster 3Cluster 4036912151821242730333639
Difference in frequency with 5 Clusters and 0.30 Score
Cluster 0Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 606.06k12.13k18.19k24.25k30.32k36.38k42.44k48.5k54.57k60.63k
Difference in amount with 7 Clusters and 0.31 Score
Cluster 0Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 603672108144180216252288324360
Difference in recency with 7 Clusters and 0.31 Score
Cluster 0Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 60510152025303540455055
Difference in frequency with 7 Clusters and 0.31 Score

Cross Selling

Market Basket Analysis with Association Rule-Mining

Apriori
FP Growth

Building Transaction Dataset

In [17]:

Pruning Dataset for frequently purchased items

We saw in the earlier on EDA how only a handful of items are responsible for bulk of our sales so we want to prune our dataset to reflect this information.

In [18]:

We use the second option of pruning, by the Top 15th products in sales events.

In [19]:
In [20]:
Total of Sales Amount by the Top 15 Products in Sales Events (Invoice): 777096.12
Number of Sales Events: 4664
Number of Products: 15
Out[20]:
item_name item_count
0 WHITE HANGING HEART T-LIGHT HOLDER 1978
1 REGENCY CAKESTAND 3 TIER 1703
2 JUMBO BAG RED RETROSPOT 1600
3 PARTY BUNTING 1379
4 ASSORTED COLOUR BIRD ORNAMENT 1375
5 LUNCH BAG RED RETROSPOT 1289
6 SET OF 3 CAKE TINS PANTRY DESIGN 1146
7 POSTAGE 1099
8 JUMBO BAG VINTAGE DOILY 1080
9 LUNCH BAG BLACK SKULL. 1052
10 LUNCH BAG SUKI DESIGN 1043
11 POPCORN HOLDER 1035
12 PACK OF 72 RETROSPOT CAKE CASES 1029
13 SPOTTY BUNTING 1009
14 LUNCH BAG VINTAGE DOILY 1006

So we find out that we have 15 items responsible for 8,73% of sales amount and close to 5% of the events result in 4.664 transactions that have those items along with other items. The next step is to convert this selected data into the required table data structure.

Association Rule Mining with FP Growth

In [21]:
In [22]:
In [23]:
In [24]:
num of required transactions =  46
Items Set Size: 663273
In [25]:
Raw rules data frame of 13831 rules generated
Explore The Association Rule Created

Let's see what we get in the first 5 rules with highest confidence:

In [26]:
Out[26]:
consequent antecedent support confidence lift
11 JUMBO BAG VINTAGE DOILY JUMBO BAG RED RETROSPOT, LUNCH BAG RED RETROSPOT, LUNCH BAG VINTAGE DOILY 60 0.909091 4.520256
72 LUNCH BAG VINTAGE DOILY LUNCH BAG RED RETROSPOT, JUMBO BAG VINTAGE DOILY , LUNCH BAG BLACK SKULL., LUNCH BAG SUKI DESIGN 49 0.890909 4.596460
73 LUNCH BAG VINTAGE DOILY LUNCH BAG RED RETROSPOT, JUMBO BAG VINTAGE DOILY , LUNCH BAG SUKI DESIGN 48 0.888889 4.586037
70 LUNCH BAG VINTAGE DOILY LUNCH BAG RED RETROSPOT, JUMBO BAG VINTAGE DOILY , LUNCH BAG BLACK SKULL. 48 0.872727 4.502655
12 JUMBO BAG VINTAGE DOILY JUMBO BAG RED RETROSPOT, LUNCH BAG SUKI DESIGN , LUNCH BAG VINTAGE DOILY 48 0.872727 4.339446

Now, the first 5 higest support:

In [27]:
Out[27]:
consequent antecedent support confidence lift
62 LUNCH BAG VINTAGE DOILY JUMBO BAG VINTAGE DOILY , LUNCH BAG RED RETROSPOT 154 0.777778 4.012783
16 JUMBO BAG VINTAGE DOILY LUNCH BAG VINTAGE DOILY , JUMBO BAG RED RETROSPOT 153 0.805263 4.003995
1 JUMBO BAG RED RETROSPOT JUMBO BAG VINTAGE DOILY , LUNCH BAG RED RETROSPOT 132 0.709677 2.737746
18 JUMBO BAG VINTAGE DOILY LUNCH BAG VINTAGE DOILY , LUNCH BAG RED RETROSPOT, JUMBO BAG RED RETROSPOT 116 0.840580 4.179599
63 LUNCH BAG VINTAGE DOILY JUMBO BAG VINTAGE DOILY , LUNCH BAG RED RETROSPOT, JUMBO BAG RED RETROSPOT 116 0.763158 3.937354
In [28]:
Out[28]:
Greater Than One    13831
Name: lift, dtype: int64